Preamble: running this code

There’s a lot of packages being loaded here! If you don’t have any of these already and want to run this code, then you’ll need these. If you want to embed the interactive LDA visualizations using the custom functions discussed below, you’ll also need to get the development version of LDAvis even if you downloaded the regular CRAN version earlier. The development version fixes an issue with the topics being numbered differently in the visualization vs in R. If you want to run the BERTopic model in Python then you’ll also need to install pandas, nltk, and bertopic.

Trying to compile this entire RMD is probably going to be very slow and maybe won’t work at all. So instead of doing that, I would recommend you just take out the chunks you really want to use and run them interactively. In general, the dictionary based stuff is going to run much faster than the topic models.

#### note that eval = false for this chunk - only run it if you need it! 

#### additional quanteda packages: 
install.packages('quanteda.textmodels')
install.packages('quanteda.textplots')

#### development version of LDAvis - fixes a problem with topic ordering
devtools::install_github("cpsievert/LDAvis")
####additional dictionaries: 

devtools::install_github("kbenoit/quanteda.dictionaries") 
remotes::install_github("quanteda/quanteda.sentiment")
###Seeded LDA 

install.packages("seededlda")
### Structural topic model 
install.packages("stm")
### additional package for viewing STM output: 
install.packages("stminsights")

###Keyword LDA
install.packages("KeyATM")

Setup: cleaning the data

Since most of these models are going to use the “bag-of-words” assumption, we can do the same cleaning steps for each. The code below should be familiar to you already, and draws from a corpus of news stories scraped from CNN AND Fox News from 2020 through September 2023. The “text” column contains the text, and the “source” column contains the source. There are also dates associated with each article.

library(tidyverse)  # for data manipulation
library(quanteda)  # for cleaning and analysis of text
library(quanteda.textstats) # extra quanteda commands
library(quanteda.textplots) # for plotting the keyness statistic


poldf<-read_csv('https://github.com/Neilblund/APAN/raw/main/news_sample.csv')

# normalizing some apostrophes and quotes (surprisingly important!)
poldf$text<-str_replace_all(poldf$text,"\u201C|\u201D", '"')%>%
  str_replace_all(.,"\u2019|\u2018", "'")
# convert to a corpus
pol_corpus<-poldf%>%
  with(.,corpus(text, 
                docvars = data.frame(url, headline,  date, source)))

# tokenize the texts
pol_tokens<-tokens(pol_corpus, 
                   what = 'word',
                   # this keeps characteristics about the documents like the headline and url
                   remove_punct =TRUE,
                   remove_symbols = TRUE,
                   remove_numbers = TRUE,
                   remove_url = TRUE,
                   remove_separators = TRUE,
                   include_docvars = TRUE
                   )
# remove stop words 
toks_nostop <- tokens_select(pol_tokens, pattern = stopwords("en"), selection = "remove")
# convert to a document-feature matrix
pol_dfm<- dfm(toks_nostop)
# stem and then unstem words (using custom function in preamble to this document)
pol_dfm<-stem_unstem(pol_dfm) 
# if you don't want to use the stem_unstem function, run this instead:  
# pol_dfm<-dfm_wordstem(dfm)  

# remove features that occur less than 10 times
pol_dfm<-dfm_trim(pol_dfm, min_docfreq= 10)
# remove features that occur in more than 10% of documents 
# increasing max_docfreq may improve results, but it will take longer for models to converge
pol_dfm<- dfm_trim(pol_dfm, max_docfreq = 0.1, docfreq_type = "prop")

Keywords in context

KWIC stands for “keywords in context”. We can use KWIC to find stories that contain a term, and get the text surrounding that word. This code would pull all documents containing terms that started with “immig” (so: immigrant, immigration, immigrants etc.) along with a window of 3 words on either side.

(note that this is actually pulling from the pol_tokens object rather than the DFM, this approach doesn’t really require any cleaning to use)

# keywords in context
kw_immig <- kwic(pol_tokens, pattern =  "immig*", window=3)
# view the first 10
head(kw_immig, 5)

For your own work, you may want to get more context by increasing the window argument in the KWIC command. You might also consider writing these results to a csv file so you can read them in a spreadsheet. KWIC doesn’t automatically provide you with nice looking plots, but it can be a very good way to do things like develop a dictionary or get a general qualitative sense of how sources are discussing something.

Keyness

Keyness is a measure of relative frequency that can be useful for finding terms that are more common in particular subsets of documents. For instance, if we want to find terms that are more common in Fox News articles compared to CNN, we could just run the following:

tstat_key <- textstat_keyness(pol_dfm, 
                              # find distinctive terms for Fox news compared to CNN :
                              target = docvars(pol_dfm)$source == "Fox News")
textplot_keyness(tstat_key)

We could also focus on a subset of articles to see how sources differ in their coverage of similar topics. Here, we’re looking at keywords that are strongly associated with documents that mention Hunter Biden. The results suggest that Fox and CNN have some very distinct terminology when discussing this issue:

hunter_dfm<-dfm_subset(pol_dfm, 
                       #  get a subset of articles that mention Hunter Biden
                       subset  = str_detect(pol_corpus, "Hunter Biden"))
tstat_key <- textstat_keyness(hunter_dfm, 
                              
                              target = docvars(hunter_dfm)$source == "Fox News")
textplot_keyness(tstat_key)

Dictionary Methods

We can use the dictionary approach to simply match terms against a dictionary of terms we’re interested in. The policy agendas dictionary contains a list of terms associated with major policy areas. Here, we apply the agendas codebook to the tokens list, which removes all terms that aren’t part of the dictionary.

# download the policy agendas codebook (discussed here #https://www.almendron.com/tribuna/wp-content/uploads/2017/05/CAP2013v2.pdf)
policyAgendas<-readRDS(url('https://github.com/Neilblund/APAN/raw/main/policy_agendas_dictionary.rds'))

immig_tokens<-tokens_subset(pol_tokens, 
                       #  get a subset of articles that mention Hunter Biden
                       subset  = str_detect(pol_corpus, "immigr*"))

# convert to a quanteda dictionary format
policyagendas.dict <- dictionary(policyAgendas)

# apply the dictionary
agenda_politics <- tokens_lookup(pol_tokens, dictionary = policyagendas.dict, levels = 1)

# check out the first result: 
agenda_politics[[1]]
##  [1] "healthcare"            "healthcare"            "healthcare"           
##  [4] "finance"               "healthcare"            "healthcare"           
##  [7] "healthcare"            "healthcare"            "healthcare"           
## [10] "healthcare"            "intl_affairs"          "healthcare"           
## [13] "healthcare"            "crime"                 "healthcare"           
## [16] "healthcare"            "healthcare"            "healthcare"           
## [19] "healthcare"            "healthcare"            "healthcare"           
## [22] "healthcare"            "healthcare"            "healthcare"           
## [25] "healthcare"            "land-water-management"

We can convert this list of tokens to a document feature matrix. A DFM will have one row per document and one column per feature, the cells will contain counts of the number of times a given feature occurs in each document. In this case, our DFM will have one column for each of the topics in the policy agendas dictionary. We can use this to do things like count the number of documents which are about some topic.

# convert to a dfm
agenda_dfm<-dfm(agenda_politics)
# look at the first few rows
head(agenda_dfm, n=5)
## Document-feature matrix of: 5 documents, 28 features (82.86% sparse) and 4 docvars.
##        features
## docs    macroeconomics civil_rights healthcare agriculture forestry labour
##   text1              0            0         22           0        0      0
##   text2              0            0         10           0        0      0
##   text3              0            0          3           0        0      1
##   text4              1            0          1           0        0      1
##   text5              0            0          1           0        0      0
##        features
## docs    immigration education environment energy
##   text1           0         0           0      0
##   text2           0         0           0      0
##   text3           0         0           0      0
##   text4           0         0           0      0
##   text5           0         0           0      0
## [ reached max_nfeat ... 18 more features ]

textstat_frequency gives us some basic stats about our document features. The “frequency” column is the number of times a policy agenda topic occurs. The “docfreq” column is the number of documents which contain that agenda topic (regardless of whether it occurs multiple times in a single document). From this, we can see that “healthcare” is the most commonly occurring agenda topic.

agenda_freq<-textstat_frequency(agenda_dfm)
head(agenda_freq)

Put them all in a horizontal barplot:

agenda_freq%>%
  ggplot(aes(x = reorder(feature, docfreq), y=docfreq)) + 
  geom_bar(stat='identity', fill='lightblue') +
  xlab('policy agenda') + 
  ylab('number of documents mentioning') +
  coord_flip() +
  # using the minimal theme: 
  theme_minimal() +
  ggtitle('Document frequency of policy agenda terms in CNN politics articles')

Additional dictionaries/commands

Quanteda does have some additional dictionaries that may be of use for some of you. These can be accessed through the Quanteda.dictionaries package and the quanteda.sentiment package. Although these don’t appear to be on CRAN yet, you can install them by running the command below, and then loading them the way you would load any other library. The associated github page has information on the included dictionary files.

devtools::install_github("kbenoit/quanteda.dictionaries") 
remotes::install_github("quanteda/quanteda.sentiment")

The textstat_polarity function will automatically apply a dictionary and measure sentence polarity by getting the log of positive/negative terms in the text. The textstat_valence function can be used to compute scores for dictionaries that have multiple categories rather than just positive vs. negative.

Here’s an example of how you might use the textstat_polarity command from this package along with the Lexicoder Sentiment Dictionary. In this example, we’re measuring the document polarity, then graphing the average sentiment for documents for each source and depending on whether they mention immigration. The results suggest that Fox News tends to give more negative coverage when discussing immigration compared to their coverage of other topics, where as CNN covers immigration with a slightly more positive valence.

library(quanteda.sentiment)
# load the lexicoder sentiment dictionary
lsd<-quanteda.sentiment::data_dictionary_LSD2015

polarity<-textstat_polarity(pol_corpus, dictionary=lsd)%>%
  # add polarity to existing document variables
  bind_cols(docvars(pol_corpus))%>%
  # also add a column for mentions of immigration
  mutate(mentions_immigration = str_detect(pol_corpus,"immigr*"))

# plot polarity for all four groups
ggplot(polarity, aes(x=source, y=sentiment, fill=mentions_immigration)) + 
  geom_boxplot(notch=TRUE) +
  theme_minimal()

# test as an interaction term:
## summary(lm(sentiment ~ source * mentions_immigration, data=polarity))

Making a custom dictionary

Finally, keep in mind that you can create a dictionary from a named list. So if you want to create your own for a specific domain or language, or extend an existing one, the syntax should be reasonably straightforward.

custom_list<-list(COVID  =c('covid','coronavirus', 'vaccin*', 'lock down'),
                  Jan6 = c( "jan* 6*", "capitol", 'insurrection*'))



custom_dict <- dictionary(custom_list)
tok<-tokens_lookup(pol_tokens, dictionary=custom_dict)


table(unlist(tok))

Any of these approaches are a good starting point and I think you can definitely use them for your analyses, but you should be willing to admit their limitations when applied outside the area they were designed for. Check out this paper for some discussion of the limitations of dictionary based approaches.

Topic Modeling

LDA

Latent Dirichlet Allocation (LDA) also known as topic modeling is widely used approach for unsupervised document classification. “Unsupervised” in this case, means that you don’t need any prior knowledge about the corpus in order to make use of it. Instead, an LDA model will attempt to model documents as as a probability distribution over topics, and topics as a probability distribution over words. This is done automatically - all you really need to do is clean the text and choose a number of topics. That said, this also means that the results can be somewhat hard to interpret: you’ll have to make an educated guess about what topics you’re finding by looking at words that occur with a higher frequency in a given topic.

Note: If you’re looking for a quick mostly non-technical primer on this, I think Edwin Chen’s blog post is a good starting point, but there’s a lot of information about these online. If you’re looking for a deeper dive, here’s a good review article on the method and recent innovations.

Here’s an example of creating an LDA model with just 15 topics.

(If you want replicable result, make sure to run set.seed([some number]) right before you train the topic model. While topic models will theoretically converge toward a single “correct” answer, the numbering of the topics is completely arbitrary in standard LDA, so topic 10 might be topic 2 next time you run it.)

library(seededlda) # package for both seeded LDA and regular LDA
library(LDAvis) # for visualization 
# number of topics: 
K<-15

# the initialization is random, so set the random number seed for replicability 
set.seed(100)
lda_model<-textmodel_lda(pol_dfm, k = K, 
              max_iter=2000, 
              auto_iter = T,
              
              # raising this result in more uniform theta
              alpha = 0.1,
              #raising this will result in a more uniform phi
              beta = 0.01
              
              )

If you view lda_model$phi you should see a matrix with 15 rows (one row per topic) and around 8000 columns (one column per term). The number in each row represent the probability of that term for the topic. (you’ll sometimes see people refer to this matrix using the Greek letter \(\phi\) or as the “topic-word distribution”)

You can see the highest probability words in each topic using the command below, you’ll probably be able to pick out some concepts pretty quickly

#top 10 terms per topic
terms(lda_model, 10)%>%
  #converting to data frame (for prettier formatting)
  data.frame()

Meanwhile, the lda_model$theta represents the topic-word distribution matrix. If you view this object you should see one column per topic and one row per document. The numbers indicate the probability of a topic in a given document. (you’ll often see this distribution denoted as \(\theta\) or as the “document-topic matrix”)

# get the document-topic distributions for the first few articles: 
head(lda_model$theta)%>%
  data.frame()

One use of the the document topic distribution is for information retrieval. Based on the most likely terms, topic 13 appears to be picking up on stories related to the Supreme Court’s decision overturning Roe v. Wade. So documents with a high % of this topic should probably be about that story. And based on the urls, this intuition seems correct:

    #get document data
dobbs_docs<-docvars(pol_dfm)%>%
  
  # add % of topic 13 to document data
  mutate(dobbs = lda_model$theta[,4])%>%
  # sort from lowest to highest topic %
  arrange(-dobbs)


dobbs_docs%>%
  select(url)%>%
  slice_head(n=10)

We can use this insight to get some sense of how the volume of coverage on this topic has changed over time. Here, we’re getting the monthly average % of this topic within each document. The dobbs_timeline object just allows us to add a set of annotations to the timeline.

# make a timeline of events (so we can add annotations to the timeline)
dobbs_timeline<-data.frame(event= c('Ginsburg death',
                                    'Cert granted',
                                    'Oral arguments',
                                    'Dobbs leak', 
                                    'Dobbs opinion'),
                date = as.Date(c("2020-09-18",
                                 '2021-05-17',
                                 '2021-12-01',
                                 '2022-05-02', 
                                 '2022-06-24'))
                  
                  )%>%
  
  # convert timeline to months (to allow joining with monthly coverage data)
  mutate(month = floor_date(date, unit='months'))

monthly<-dobbs_docs%>%
  # round down to the nearest month
  mutate(month = floor_date(date, unit='months')
         )%>%
  
  # group by month and compute the monthly average: 
  group_by(month)%>%
  summarise(coverage = mean(dobbs)
            )%>%
  # join with timeline data
  left_join(dobbs_timeline)

And from here, we’ll create the plot in GGplot:

# create plot: 
monthly%>%
  ggplot(aes(x=month, y=coverage, label=event)) + 
  # create the line plot
  geom_line() +
  # add points for each month
  geom_point() +
  # turn off clipping (avoids removing labels if they're outside the plot area)
  coord_cartesian(clip = "off") + 
  
  # add labels (vjust -1 sets annotation just above the timeline)
  geom_label(vjust=-1) +
  ylab('Average topic %') +
  # using the minimal theme: 
  theme_minimal() +
  ggtitle('Average percent topic 13 (Roe overturned)')  +
  # show ticks every 3 months
  scale_x_date(breaks="3 month", date_labels = "%Ob '%y") +
  # labels at 45 degree angle
  theme(axis.text.x = element_text(angle = 45))

Lastly, you can use the LDAvis package to visualize and interpret your topics.

The left panel uses a principal component analysis to model the “distance” between each topic. Topics that are further apart are less similar to one another, and larger bubbles indicate the topic is more common. Selecting a topic will cause the right panel to display relevant terms for that topic. Note: clicking the links in the bottom right will bring up more detailed information on what these metrics mean, but the short version is that they’re both meant to make it easier to identify the most distinctive terms for each topic, rather than the terms that have the highest frequency.

In order to (hopefully) simplify the process of setup, there’s a custom function included in the header of this notebook that will allow you to save and embed these visualizations. Here’s the code you would use to make this display on your local device:

# a filename to save the plot under
modelvis<-'lda_vis.html'
# create the visualization
LDAvis_standalone(lda_model, outputfile=modelvis)
# bring up a window with the visualization: 
browseURL(modelvis)

You could take screenshots of this visualization for different topics in order to share it, but if you wanted to share the actual interactive version of this visualization in an Rmarkdown file or notebook, you could do it using the includeHTML function from the htmltools package.

lda_vis<-'lda_visual.html'
LDAvis_standalone(lda_model, outputfile=lda_vis)
htmltools::includeHTML(lda_vis)

Seeded LDA

In seeded LDA, you supply some initial terms and topics, and the model will infer additional terms based on your words.

Here I’ve created a custom dictionary with 10 topics and just a handful of terms for each. Notice that terms with * will be able to match different endings, so vaccin* should be able to match any of vaccinate, vaccine, and vaccination.

news_dict<-list('covid' = c('covid*', 'vaccin*', 'lockdown*'),
                'natsec' = c('afghan*', 'taliban', 'putin', 'china', 'ukrain*','zelensky', 'terror*'),

                'elections' = c('election', 'caucus', 'primary', 'debate*', 'nomination', 'midterm', 'vote*', 'poll*'),
                'trump' = c('indictment', 'impeachment', 'mar-a-lago'),
                'jan6' = c("capitol", 'january', 'chansley','fraud'),
                'blm'  = c('floyd' , 'black', 'police', "protest*"),
                'immigration' = c('immigrat*',  'border', 'homeland', 'migrant*'),
                'climate' = c('global','climate*','fuel', 'carbon', 'emission*'),
                'abortion' = c('abort*',  'roe', 'wade', 'fetus', 'dobbs'),
                'economy' = c('inflation','price*', 'tax')
                )
# conver to a quanteda dictionary
dict_topic<-dictionary(news_dict)
set.seed(100) # setting random number seed
slda_model <- textmodel_seededlda(pol_dfm, 
                                 dictionary = dict_topic, 
                                 auto_iter = T, 
                                 residual = 5, # 5 residual topics for garbage
                                 max_iter=2000)

Again, we can extract topics from this model using the terms function. You’ll notice that there are topics that correspond to the categories in the dictionary, but there are lots of words that weren’t initially given to the model. One useful application of this approach is it can help you come up additional terms when constructing a dictionary.

terms(slda_model)%>%
  #converting to a data frame (just to make it look nicer in the notebook)
  data.frame()

And, just like with regular LDA, we can use the inferred topics to do things like track news coverage over time (note that the columns now have names based on our dictionary) Here’s how coverage of Covid-19 topic steadily dropped since 2020.

covid_df<-
  docvars(pol_dfm)%>%
  # add % of covid topic
  mutate(covid = slda_model$theta[,'covid'],
           # round down to the nearest month
         month = floor_date(date, unit='months'))%>%
  # group by month and compute the monthly average: 
  group_by(month)%>%
  summarise(covid = mean(covid))
covid_df%>%
  ggplot(aes(x=month, y=covid)) + 
  # create the line plot
  geom_line() +
  geom_point() +
  #add a marker for the start of Floyd protests:

  ylab('Average topic %') +
  # using the minimal theme: 
  theme_minimal() +
  ggtitle('Average % covid topic by month') +
  scale_x_date(breaks="3 month", date_labels = "%Ob '%y") +
  # labels at 45 degree angle
  theme(axis.text.x = element_text(angle = 45))

Lastly, the same method we used for LDAVis above should work here (unfortunately this doesn’t pick up on the topic names)

slda_vis<-'slda_vis.html'
LDAvis_standalone(slda_model, outputfile=slda_vis)
htmltools::includeHTML(slda_vis)

Structural Topic Models

Unlike traditional LDA, Structural Topic Models (paper) allow users to include additional document-level data when inferring their topics. The “prevalence model” will allow topic probabilities to vary systematically according to a characteristic of the document (such as the author or source), the “content model” will allow the term-probabilities within each topic to vary in the same way. For instance: if we think that different sources are more likely to talk about certain issues, we can explicitly model that information as part of the inference process. Here, we’re using the source variable (Fox News vs. CNN) as a predictor for the topic probabilities. The assumption (a reasonably safe one) is that the two outlets will differ systematically in the sorts of news they cover.

library(stm) # for structural topic models
# convert DFM to a format usable by STM
dfm2stm <- convert(pol_dfm, to = "stm")
# number of topics
K<-15
# random number seed 
set.seed(1000)
# run the model
stm_model <- stm(dfm2stm$documents, dfm2stm$vocab, 
                 K = K, 
                 data = dfm2stm$meta, 
                 # include source in the prevalence model
                 prevalence =  ~ source,
                 init.type = "Spectral",
                 # set verbose =T if you want to watch the model progress
                 verbose = F 
                 )

The STM package includes a special labelTopics function to retrieve the terms associated with each topic. While the highest probability terms can be a good way to identify topics, it can be somewhat unhelpful because sometimes terms show up at the topic of the list because they’re just common words in general. The FREX, lift, and score metrics all attempt to account for term frequency in some way in order to give greater weight to terms that are more distinct to each topic.

labelTopics(stm_model, topics=1:5, n=5) # note - only showing first 5 of 15 topics here
## Topic 1 Top Words:
##       Highest Prob: vaccine, testing, mask, virus, mandate 
##       FREX: vaccine, mask, virus, fauci, newsom 
##       Lift: niaid, stay-at-home, unvaccinated, allergy, azar 
##       Score: vaccine, mask, virus, fauci, cdc 
## Topic 2 Top Words:
##       Highest Prob: impeachment, prosecutors, trial, pence, indictment 
##       FREX: jury, indictment, giuliani, trial, insurrection 
##       Lift: jurors, clemency, fani, sidney, cipollone 
##       Score: indictment, riley, prosecutors, impeachment, giuliani 
## Topic 3 Top Words:
##       Highest Prob: pennsylvania, seat, carolina, endorsed, scott 
##       FREX: runoff, mail-in, hampshire, warnock, fetterman 
##       Lift: elissa, laxalt, nrsc, outraised, outspent 
##       Score: warnock, kemp, absentee, runoff, mail-in 
## Topic 4 Top Words:
##       Highest Prob: protests, obama, man, george, video 
##       FREX: floyd, protests, kennedy, confederate, bush 
##       Lift: portraits, song, confederate, knee, nfl 
##       Score: portraits, floyd, confederate, lady, protests 
## Topic 5 Top Words:
##       Highest Prob: cuomo, resign, mayor, sexual, santos 
##       FREX: santos, cuomo, ocasio-cortez, omar, blasio 
##       Lift: boylan, derosa, santos, tara, blasio 
##       Score: santos, cuomo, ocasio-cortez, blasio, sexual

You can view stm_model$theta to get the document-topic probabilities. But STM also includes the findThoughts command to help identify documents that are representative of a given topic. Here, I’m getting a document that is representative of topic 6, which appears to be related to immigration and border control issues

findThoughts(stm_model, 
             # list of texts (the dimnames part here ensures that the doc names line up with the names in the DFM)
             texts = pol_corpus[pol_dfm@Dimnames[[1]]], 
             # topic 6
             topics = 6 ,
             n = 1)
## 
##  Topic 6: 
##       An investigator generalreport found that the U.S. Environmental Protection Agency (EPA) "improperly" doled out and managed millions of dollars worth of taxpayer dollars in information technologycontracts between 2017 and 2019. The EPA's Office of Inspector General reported Wednesday that the agency had spent $52.5 million in taxpayer money on the deals, which the watchdog had investigated following a complaint about "contract and bidding irregularities with three major information technology contracts." The report said the EPA violated "Federal Acquisition Regulation requirements and contract clauses" when they "purchased 23 pieces of hardware and software equipment" through an "expiring" IT contract with Canadian IT company CGI Federal. "This purchase was outside the scope of the contract and was ultimately never used for that contract," wrote the IG's office. "The EPA then improperly solicited bids for one of two subsequent contracts and transferred the equipment to use on the new contract."  "By approving the purchase, the EPA improperly spent $641,680 in federal funds," the report continued. The report went on to reveal that the EPA "issued task orders" under the three contracts "without approval" from the agency's CIO –which the watchdog says is "required under the Federal Information Technology Acquisition Reform Act (FITARA)." These tasks orders "resulted in the EPA spending $52.5 million in taxpayer funds without proper approvals," the report read. "The agency also mismanaged these contracts with respect to monitoring property and licenses," the report continued. "For example, the EPA underreported and incorrectly identified purchased equipment in the agency's property reporting system and did not record $1.18 million in software licenses in the agency's asset management system."  The IG said in the conclusion of the report that the EPA "CIO was unaware of significant acquisitions made by the agency and was unable to appropriately review price competitions, cost reductionsand duplicate equipment." "As a result, the EPA spent $52.5 million in taxpayer dollars without the CIO's oversight," the IG concluded. The watchdog recommended that the agency's assistant administrator for mission support "[r]eview all active contracts for acquisitions of information technology hardware, softwareand services in fiscal year 2016 and later" to see if the "required" FITARA approvals had been "obtained." The IG also called on the assistant administrator to "identify cost findings in the process from hardware and software purchases that were either duplicates or unnecessary" and to "[r]equire contracting officers to maintain records" of CIO approvals, "including those under" FITARA. When pressed for comment, a spokesperson for the EPA pointed Fox News to their response in the IG report, where the agency said it "agreed" with the IG's recommendations in the report.  The confirmation ofPresident Joe Biden's pick to lead the EPA, Michael Regan, is up for a vote in the Senate this week.

The more important innovation here is that we can estimate the effect of a covariate on the topic prevalence. Here, I’m estimating the effect of source on the prevalence of all 15 topics (although I’m only showing the results for topic 6)

fx <- estimateEffect(1:15 ~ source , stm_model,
                        meta = dfm2stm$meta)
# only showing topic 6 (remove the $tables[[6]] part to see all topics)
summary(fx)$tables[[6]]
##                   Estimate  Std. Error  t value     Pr(>|t|)
## (Intercept)    0.046126691 0.003498798 13.18358 6.987575e-39
## sourceFox News 0.008347122 0.004820188  1.73170 8.340409e-02

We can also plot the results (although this is very messy for more than a handful of topics). In either case, we see Fox is generally more likely to cover the immigration topic compared to CNN.

plot.estimateEffect(fx, 
                    covariate = "source", 
                    topics = 1:6,
                    # value of fox news - cnn
                    cov.value1 = "Fox News",
                    cov.value2  = "CNN",
                    method='difference',
                    # you'll need to adjust this to keep the text from getting cut off 
                  xlim  = c(-.1, .1)
                    )

There’s also a (slightly janky) package called stminsights that will allow you to do some of this interactively. This isn’t something you can really share unless you upload it to a server somewhere, but you can use it to generate and then download some plots which you could use elsewhere.

# for some reason, the dfm data must be named "out" or else this doesn't work. 
out<-dfm2stm
# save the relevant model outputs: 
save(file='stm_model.RData', list = c('stm_model', 'out', 'fx', 'pol_corpus'))
library(stminsights)
# go to browse and find stm_model.Rdata to get this running 
run_stminsights()

Keyword Assisted Topic Models

Keyword Assisted Topic models (see: paper) are a new approach that allows both document covariates and seed terms to define topics. In short: it combines the SeededLDA approach with the Structural Topic Model. This is a pretty new method and the package is still under development, but I’m including it here because it might be of interest for those of you who are already using dictionary based methods.

The setup for the basic model is almost identical to seeded LDA, the main practical difference is that it automatically infers the \(\alpha\) and \(\beta\) parameters that control how diffuse the \(\theta\) and \(\phi\) are, which can save you the effort of having to tune these parameters.

We can use the same dictionary we implemented before. The keyATM package also provides some additional options for checking the quality of our selected keyword - notice that it throws a warning because some of the terms in our keyword list don’t actually appear in the corpus. Some terms were probably too common and got removed when we ran dfm_trim, other terms may just be too rare to be useful.

Although it will make the model run slower, you might want to increase max_docfreq when you clean the data.

library(keyATM) # keyword assisted topic models
# our news dictionary
news_dict<-list('covid' = c('covid*', 'vaccin*', 'lockdown*'),
                'natsec' = c('afghan*', 'taliban', 'putin', 
                             'china', 'ukrain*','zelensky', 'terror*'),
                'elections' = c('election', 'caucus', 'primary', 'debate*', 
                                'nomination', 'midterm', 'vote*', 'poll*'),
                'trump' = c('indictment', 'impeachment', 'mar-a-lago'),
                'jan6' = c("capitol", 'january', 'chansley','fraud'),
                'blm'  = c('floyd' , 'black', 'police', "protest*", 'blm','black-lives-matter'),
                'immigration' = c('immigrat*',  'border', 'homeland', 'migrant*'),
                'climate' = c('global','climate*','fuel', 'carbon', 'emission*'),
                'abortion' = c('abort*',  'roe', 'wade', 'fetus', 'dobbs'),
                'economy' = c('inflation','price*', 'tax')
                )

# converted to a quanteda dictionary
dict_topic<-dictionary(news_dict)
# convert the pol_dfm to a structure preferred by this package
keyATM_docs<-keyATM_read(pol_dfm)
# convert the quanteda dictionary to a format used by keyATM 
keys<-read_keywords(docs = keyATM_docs, dictionary = dict_topic)

#visualize keywords: 
key_viz<-visualize_keywords(docs = keyATM_docs, keywords = keys)
key_viz

And here’s the model training function. You’ll probably notice that this is quite a bit slower than either the standard or seeded LDA. You might want to get this running and go grab a snack. The authors recommend saving the model after it is finished running, so I’m going to do that here. To load it back up, you would just need to run `klda_model <- readRDS('klda.rds')'.

klda_model <- keyATM(
  docs              = keyATM_docs,    # text input
  no_keyword_topics = 5,              # 5 extra garbage topics
  keywords          = keys,       # our list of terms
  model             = "base",         # the base model
  
  options           = list(seed = 250) # set the random number seed inside the function
)

# save the trained model
saveRDS(klda_model, file='klda.rds')

You can get the highest probability words using the top_words function. The resulting dataframe will have a checkmark next to terms that were part of your keyword list.

top_words(klda_model)

Once again, we can use this to retrieve specific documents (and to check the quality of our topics)

election_df<-docvars(pol_dfm)%>%
  # add % of covid topic
  mutate(election= klda_model$theta[,'3_elections'])%>%
  arrange(-election)

election_df%>%
  select(url)%>%
  slice_head(n=10)

And we can use this to track news coverage over time. Here, we see peaks in the “elections” topic that roughly coincide with presidential general election seasons in the time period.

election_coverage<-election_df%>%
  mutate(# round down month
         month = floor_date(date, unit='months'))%>%
  # group by month and compute the monthly average: 
  group_by(month)%>%
  summarise(election= mean(election))

election_coverage%>%
  ggplot(aes(x=month, y=election)) + 
  # create the line plot
  geom_line() +
  geom_point() +
  ylab('Average topic %') +
  # using the minimal theme: 
  theme_minimal() +
  ggtitle('Average % election topic by month') 

Adding metadata

The real feature that distinguishes Keyword assisted topic models from seeded LDA is the ability to incorporate document meta data to predict topic prevalence a very similar fashion to the structural topic model. The syntax is here slightly different, but the basic intuition here is the same as in STM: we think that Fox News and CNN will vary systematically in the amount of attention they give to different subjects. (note: this will take a while to converge)

klda_covariate <- keyATM(
  docs              = keyATM_docs,
  no_keyword_topics = 5,
  keywords          = keys,
  model             = "covariates",
  model_settings    = list(covariates_data    = docvars(pol_dfm),
                           covariates_formula = ~ source),
  options           = list(seed = 250)
)

# save the results after running 
saveRDS(klda_covariate, file='klda_covariate.rds')

We can the top words using the same methods as before. It’s possible that the results here will have improve somewhat because the topic model is now incorporating data on the source into the metadata. However, the main finding of interest will likely come from viewing the impact of the covariates on the topic probabilities. We can retrieve these with by_strata_DocTopic() and plot them using the command below. Here, we’re plotting the prevalence for the immigration topic. The results suggest that Fox News is significantly more likely to cover this issue (at least in this sample)

#sometimes covariates will be renamed automatically
#use covariates_info(klda_covariate) to see how your covariates might have been renamed
strata_topic <- by_strata_DocTopic(
  klda_covariate, 
  # "source" here was automatically changed to sourceFox News
  by_var = "sourceFox News",
  labels = c("CNN", "Fox")
)

plot(strata_topic, var_name = 'source', show_topic= 7)

BERTopic

Lastly, while probably outside the scope of this course, I do want to highlight a final variation of the topic model known as BERTopic paper. Rather than inferring topics using a document-term-matrix, BERTopic uses an embedding model - a kind of numeric representation of text that captures semantic similarities between words, sentences or documents (good primer on this concept here). Importantly, we can use pre-existing embeddings that were trained on a huge corpus and apply them to a new set of documents, which gives us a major head-start when trying to train a model.

BERTopic generally works better on paragraphs or even single sentences, so we are going to use the same dataset here, but only take the first 5 sentences from each article. Other than that, we really don’t need to do any pre-cleaning of the data because the sentence embedding model we’re applying here was trained on regular human sentences and can use word order and context clues when producing numeric representations of sentences.

import pandas as pd
from nltk.tokenize import sent_tokenize, word_tokenize
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer

# read the corpus
poldocs = pd.read_csv('https://github.com/Neilblund/APAN/raw/main/news_sample.csv')

# split the text into sentences 
sentences = [sent_tokenize(text) for text in poldocs.text]
# get the first three sentences from each text
sentences = [' '.join(x[:4]) for x in sentences]


# apply a pre-trained embedding model on the sentences
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = embedding_model.encode(sentences, show_progress_bar=False)


topic_model = BERTopic(embedding_model = embedding_model)

# train the model
topics, probs = topic_model.fit_transform(sentences, embeddings)
 

# save the results in the working directory
topic_model.save("./bertmodel", serialization="safetensors", save_ctfidf=True, save_embedding_model=embedding_model)

The bertopic package offers a number of options for visualizing our results. The first plot shows an inter-topic distance map (similar to the kind provided by LDAVis). Notably: the BERTopic model inferred a much larger number of topics (~60) here than what we set in previous iterations, and many appear to reflect a much finer level of detail.

from bertopic import BERTopic

# loading the saved model
loaded_model = BERTopic.load("./bertmodel")

# visualize topic similarities:
loaded_model.visualize_topics()
loaded_model.visualize_barchart(topics = [ 1,3, 4, 14, 17] )
# getting topic counts by source
classes = r.poldf.source
topics_per_class = loaded_model.topics_per_class(r.poldf.text, classes=classes)

# abortion, ukraine, hunter biden, gender, and jan 6
loaded_model.visualize_topics_per_class(topics_per_class, topics = [ 1,3, 4, 14, 17], normalize_frequency= True)

Other stuff to look at

Many of these examples were adapted from the Quanteda documentation. The package documentation is here, and there is (also a tutorial here. One important area of text analysis not covered here are fully supervised models, where you take a set of labeled documents and train a model to predict a set of unlabeled documents. While useful and an important part of the text analysis field more broadly, it these models typically require a lot of hand coding before they really begin to be effective. That said, the quanteda tutorial above does go into some detail on how to train a basic text classifier in R.

If you’re really interested in going in depth, one of the best sources on this is Dan Jurafsky who makes slides, lectures, and even an entire textbook freely available online.

---
title: "Additional text analysis information"
output:
  html_document:
    df_print: paged
    toc: true
    toc_float: true
    collapsed: false
    code_download: true
    code_folding: show
    

---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, warning =FALSE, message=FALSE, cache = T)
LDAvis_standalone <- function(model, outputfile) {
  require(LDAvis)
  # method for WarpLDA
  if(class(model)[1] == "WarpLDA"){
    json = createJSON(
      phi = model$.__enclos_env__$private$topic_word_distribution_with_prior(),
      theta = model$.__enclos_env__$private$doc_topic_distribution_with_prior(),
      doc.length = model$.__enclos_env__$private$doc_len,
      vocab = model$.__enclos_env__$private$vocabulary,
      term.frequency = colSums(model$.__enclos_env__$self$components),
      reorder.topics = FALSE
      
    )
  }
  # method for textmodel_lda
  if(class(model)[1] == 'textmodel_lda'){
    json<-createJSON(
      phi = model$phi,
      theta = model$theta,
      doc.length = rowSums(model$data),
      vocab = colnames(model$data),
      term.frequency = colSums(model$data),
      reorder.topics = FALSE)
  }
  
  if(class(model)[1] == "keyATM_output"){
    json = createJSON(
      phi = model$phi,
      theta = model$theta,
      doc.length = model$doc_lens, 
      vocab = model$vocab,
      term.frequency = model$word_counts,
      reorder.topics = FALSE
    )
    
  }
  
  
  id_name<-paste0('eldavis_', gsub("[^0-9]", "", as.numeric(Sys.time())), collapse="")
  
  
  cat(
    sprintf(
      '<link rel="stylesheet" type="text/css" href="https://cdn.jsdelivr.net/gh/bmabey/pyLDAvis@3.4.0/pyLDAvis/js/ldavis.v1.0.0.css">

<div id="%s" style="background-color:white;"></div>
    <script type="text/javascript">
    var ldavis_data = %s

function LDAvis_load_lib(url, callback){
  var s = document.createElement(\'script\');
  s.src = url;
  s.async = true;
  s.onreadystatechange = s.onload = callback;
  s.onerror = function(){console.warn("failed to load library " + url);};
  document.getElementsByTagName("head")[0].appendChild(s);
}

if(typeof(LDAvis) !== "undefined"){
   // already loaded: just create the visualization
   !function(LDAvis){
       new LDAvis("#" + "%s", ldavis_data);
   }(LDAvis);
}else if(typeof define === "function" && define.amd){
   // require.js is available: use it to load d3/LDAvis
   require.config({paths: {d3: "https://d3js.org/d3.v5"}});
   require(["d3"], function(d3){
      window.d3 = d3;
      LDAvis_load_lib("https://cdn.jsdelivr.net/gh/bmabey/pyLDAvis@3.4.0/pyLDAvis/js/ldavis.v3.0.0.js", function(){
        new LDAvis("#" + "%s", ldavis_data);
      });
    });
}else{
    // require.js not available: dynamically load d3 & LDAvis
    LDAvis_load_lib("https://d3js.org/d3.v5.js", function(){
         LDAvis_load_lib("https://cdn.jsdelivr.net/gh/bmabey/pyLDAvis@3.4.0/pyLDAvis/js/ldavis.v3.0.0.js", function(){
                 new LDAvis("#" + "%s", ldavis_data);
            })
         });
}
</script>', id_name ,paste(json), id_name, id_name, id_name), file  =  outputfile)}

#custom function for stemming and then unstemming a dfm
stem_unstem<-function(dfm, matchby='frequency'){
 if(matchby == 'shortest'){
    vocab<-colnames(dfm)[order(nchar(colnames(dfm)), decreasing=T)]
 }
 if(matchby == 'frequency'){
     vocab<-names(sort(featfreq(dfm), decreasing=T))

 }
  stemmed<-char_wordstem(vocab, check_whitespace=FALSE)
  dfm<-dfm_replace(dfm, vocab, stemmed)
  
  dfm<- dfm_replace(dfm, featnames(dfm), vocab[match(featnames(dfm), stemmed)])
  return(dfm)
}
      
```


# Preamble: running this code

There's a lot of packages being loaded here! If you don't have any of these already and want to run this code, then you'll need these. If you want to embed the interactive LDA visualizations using the custom functions discussed below, you'll also need to get the development version of `LDAvis` even if you downloaded the regular CRAN version earlier. The development version fixes an issue with the topics being numbered differently in the visualization vs in R. If you want to run the BERTopic model in Python then you'll also need to install `pandas`, `nltk`, and `bertopic`. 

Trying to compile this entire RMD is probably going to be very slow and maybe won't work at all. So instead of doing that, I would recommend you just take out the chunks you really want to use and run them interactively. In general, the dictionary based stuff is going to run much faster than the topic models.  


```{r, eval=FALSE}

#### note that eval = false for this chunk - only run it if you need it! 

#### additional quanteda packages: 
install.packages('quanteda.textmodels')
install.packages('quanteda.textplots')

#### development version of LDAvis - fixes a problem with topic ordering
devtools::install_github("cpsievert/LDAvis")
####additional dictionaries: 

devtools::install_github("kbenoit/quanteda.dictionaries") 
remotes::install_github("quanteda/quanteda.sentiment")
###Seeded LDA 

install.packages("seededlda")
### Structural topic model 
install.packages("stm")
### additional package for viewing STM output: 
install.packages("stminsights")

###Keyword LDA
install.packages("KeyATM")


```

# Setup: cleaning the data

Since most of these models are going to use the "bag-of-words" assumption, we can do the same cleaning steps for each. The code below should be familiar to you already, and draws from a corpus of news stories scraped from CNN AND Fox News from 2020 through September 2023. The "text" column contains the text, and the "source" column contains the source. There are also dates associated with each article. 

```{r}
library(tidyverse)  # for data manipulation
library(quanteda)  # for cleaning and analysis of text
library(quanteda.textstats) # extra quanteda commands
library(quanteda.textplots) # for plotting the keyness statistic


poldf<-read_csv('https://github.com/Neilblund/APAN/raw/main/news_sample.csv')

# normalizing some apostrophes and quotes (surprisingly important!)
poldf$text<-str_replace_all(poldf$text,"\u201C|\u201D", '"')%>%
  str_replace_all(.,"\u2019|\u2018", "'")
# convert to a corpus
pol_corpus<-poldf%>%
  with(.,corpus(text, 
                docvars = data.frame(url, headline,  date, source)))

# tokenize the texts
pol_tokens<-tokens(pol_corpus, 
                   what = 'word',
                   # this keeps characteristics about the documents like the headline and url
                   remove_punct =TRUE,
                   remove_symbols = TRUE,
                   remove_numbers = TRUE,
                   remove_url = TRUE,
                   remove_separators = TRUE,
                   include_docvars = TRUE
                   )
# remove stop words 
toks_nostop <- tokens_select(pol_tokens, pattern = stopwords("en"), selection = "remove")
# convert to a document-feature matrix
pol_dfm<- dfm(toks_nostop)
# stem and then unstem words (using custom function in preamble to this document)
pol_dfm<-stem_unstem(pol_dfm) 
# if you don't want to use the stem_unstem function, run this instead:  
# pol_dfm<-dfm_wordstem(dfm)  

# remove features that occur less than 10 times
pol_dfm<-dfm_trim(pol_dfm, min_docfreq= 10)
# remove features that occur in more than 10% of documents 
# increasing max_docfreq may improve results, but it will take longer for models to converge
pol_dfm<- dfm_trim(pol_dfm, max_docfreq = 0.1, docfreq_type = "prop")
```



# Keywords in context 

KWIC stands for "keywords in context". We can use KWIC to find stories that contain a term, and get the text surrounding that word. This code would pull all documents containing terms that started with "immig" (so: immigrant, immigration, immigrants etc.) along with a window of 3 words on either side. 

(note that this is actually pulling from the ``pol_tokens`` object rather than the DFM, this approach doesn't really require any cleaning to use)

```{r}
# keywords in context
kw_immig <- kwic(pol_tokens, pattern =  "immig*", window=3)
# view the first 10
head(kw_immig, 5)
```

For your own work, you may want to get more context by increasing the ``window`` argument in the KWIC command. You might also consider writing these results to a csv file so you can read them in a spreadsheet. KWIC doesn't automatically provide you with nice looking plots, but it can be a very good way to do things like develop a dictionary or get a general qualitative sense of how sources are discussing something.

#  Keyness

Keyness is a measure of relative frequency that can be useful for finding terms that are more common in particular subsets of documents. For instance, if we want to find terms that are more common in Fox News articles compared to CNN, we could just run the following: 

```{r}

tstat_key <- textstat_keyness(pol_dfm, 
                              # find distinctive terms for Fox news compared to CNN :
                              target = docvars(pol_dfm)$source == "Fox News")
textplot_keyness(tstat_key)

```


We could also focus on a subset of articles to see how sources differ in their coverage of similar topics. Here, we're looking at keywords that are strongly associated with documents that mention Hunter Biden. The results suggest that Fox and CNN have some very distinct terminology when discussing this issue:  

```{r}
hunter_dfm<-dfm_subset(pol_dfm, 
                       #  get a subset of articles that mention Hunter Biden
                       subset  = str_detect(pol_corpus, "Hunter Biden"))
tstat_key <- textstat_keyness(hunter_dfm, 
                              
                              target = docvars(hunter_dfm)$source == "Fox News")
textplot_keyness(tstat_key)

```


# Dictionary Methods

We can use the dictionary approach to simply match terms against a dictionary of terms we're interested in. The policy agendas dictionary contains a list of terms associated with major policy areas. Here, we apply the agendas codebook to the tokens list, which removes all terms that aren't part of the dictionary. 

```{r}
# download the policy agendas codebook (discussed here #https://www.almendron.com/tribuna/wp-content/uploads/2017/05/CAP2013v2.pdf)
policyAgendas<-readRDS(url('https://github.com/Neilblund/APAN/raw/main/policy_agendas_dictionary.rds'))

immig_tokens<-tokens_subset(pol_tokens, 
                       #  get a subset of articles that mention Hunter Biden
                       subset  = str_detect(pol_corpus, "immigr*"))

# convert to a quanteda dictionary format
policyagendas.dict <- dictionary(policyAgendas)

# apply the dictionary
agenda_politics <- tokens_lookup(pol_tokens, dictionary = policyagendas.dict, levels = 1)

# check out the first result: 
agenda_politics[[1]]

```


We can convert this list of tokens to a document feature matrix. A DFM will have one row per document and one column per feature, the cells will contain counts of the number of times a given feature occurs in each document. In this case, our DFM will have one column for each of the topics in the policy agendas dictionary. We can use this to do things like count the number of documents which are about some topic. 


```{r}
# convert to a dfm
agenda_dfm<-dfm(agenda_politics)
# look at the first few rows
head(agenda_dfm, n=5)
```

``textstat_frequency`` gives us some basic stats about our document features. The "frequency" column is the number of times a policy agenda topic occurs. The "docfreq" column is the number of documents which contain that agenda topic (regardless of whether it occurs multiple times in a single document). From this, we can see that "healthcare" is the most commonly occurring agenda topic. 

```{r}
agenda_freq<-textstat_frequency(agenda_dfm)
head(agenda_freq)
```

Put them all in a horizontal barplot: 

```{r agendaplot}
agenda_freq%>%
  ggplot(aes(x = reorder(feature, docfreq), y=docfreq)) + 
  geom_bar(stat='identity', fill='lightblue') +
  xlab('policy agenda') + 
  ylab('number of documents mentioning') +
  coord_flip() +
  # using the minimal theme: 
  theme_minimal() +
  ggtitle('Document frequency of policy agenda terms in CNN politics articles')
```


## Additional dictionaries/commands

Quanteda does have some additional dictionaries that may be of use for some of you. These can be accessed through the [Quanteda.dictionaries](https://github.com/kbenoit/quanteda.dictionaries) package and the [quanteda.sentiment](https://github.com/quanteda/quanteda.sentiment) package. Although these don't appear to be on CRAN yet, you can install them by running the command below, and then loading them the way you would load any other library. The associated github page has information on the included dictionary files. 




```{r, eval=FALSE}
devtools::install_github("kbenoit/quanteda.dictionaries") 
remotes::install_github("quanteda/quanteda.sentiment")
```


The [textstat_polarity](https://rdrr.io/github/quanteda/quanteda.sentiment/man/textstat_polarity.html) function will automatically apply a dictionary and measure sentence polarity by getting the log of positive/negative terms in the text. The [textstat_valence](https://rdrr.io/github/quanteda/quanteda.sentiment/man/textstat_valence.html) function can be used to compute scores for dictionaries that have multiple categories rather than just positive vs. negative.

Here's an example of how you might use the `textstat_polarity` command from this package along with the [Lexicoder Sentiment Dictionary](https://www.snsoroka.com/data-lexicoder/). In this example, we're measuring the document polarity, then graphing the average sentiment for documents for each source and depending on whether they mention immigration. The results suggest that Fox News tends to give more negative coverage when discussing immigration compared to their coverage of other topics, where as CNN covers immigration with a slightly more positive valence. 

```{r}
library(quanteda.sentiment)
# load the lexicoder sentiment dictionary
lsd<-quanteda.sentiment::data_dictionary_LSD2015

polarity<-textstat_polarity(pol_corpus, dictionary=lsd)%>%
  # add polarity to existing document variables
  bind_cols(docvars(pol_corpus))%>%
  # also add a column for mentions of immigration
  mutate(mentions_immigration = str_detect(pol_corpus,"immigr*"))

# plot polarity for all four groups
ggplot(polarity, aes(x=source, y=sentiment, fill=mentions_immigration)) + 
  geom_boxplot(notch=TRUE) +
  theme_minimal()

# test as an interaction term:
## summary(lm(sentiment ~ source * mentions_immigration, data=polarity))

```
## Making a custom dictionary

Finally, keep in mind that you can create a dictionary from a named list. So if you want to create your own for a specific domain or language, or extend an existing one, the syntax should be reasonably straightforward. 


```{r, eval=FALSE}
custom_list<-list(COVID  =c('covid','coronavirus', 'vaccin*', 'lock down'),
                  Jan6 = c( "jan* 6*", "capitol", 'insurrection*'))



custom_dict <- dictionary(custom_list)
tok<-tokens_lookup(pol_tokens, dictionary=custom_dict)


table(unlist(tok))

```


Any of these approaches are a good starting point and I think you can definitely use them for your analyses, but you should be willing to admit their limitations when applied outside the area they were designed for. Check out [this paper](https://www.cambridge.org/core/journals/political-analysis/article/text-as-datathe-promise-and-pitfalls-of-automatic-content-analysis-methods-for-politicaltexts/F7AAC8B2909441603FEB25C156448F20) for some discussion of the limitations of dictionary based approaches. 


# Topic Modeling

## LDA

Latent Dirichlet Allocation (LDA) also known as topic modeling is widely used approach for unsupervised document classification. "Unsupervised" in this case, means that you don't need any prior knowledge about the corpus in order to make use of it. Instead, an LDA model will attempt to model documents as as a probability distribution over topics, and topics as a probability distribution over words. This is done automatically - all you really need to do is clean the text and choose a number of topics. That said, this also means that the results can be somewhat hard to interpret: you'll have to make an educated guess about what topics you're finding by looking at words that occur with a higher frequency in a given topic.

Note: If you're looking for a quick mostly non-technical primer on this, I think Edwin Chen's [blog post](https://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/) is a good starting point, but there's a lot of information about these online. If you're looking for a deeper dive, [here's a good review article](https://link.springer.com/article/10.1007/s11042-018-6894-4) on the method and recent innovations.



Here's an example of creating an LDA model with just 15 topics. 

(If you want replicable result, make sure to run `set.seed([some number])` right before you train the topic model. While topic models will theoretically converge toward a single "correct" answer, the numbering of the topics is completely arbitrary in standard LDA, so topic 10 might be topic 2 next time you run it.)

```{r lda}
library(seededlda) # package for both seeded LDA and regular LDA
library(LDAvis) # for visualization 
# number of topics: 
K<-15

# the initialization is random, so set the random number seed for replicability 
set.seed(100)
lda_model<-textmodel_lda(pol_dfm, k = K, 
              max_iter=2000, 
              auto_iter = T,
              
              # raising this result in more uniform theta
              alpha = 0.1,
              #raising this will result in a more uniform phi
              beta = 0.01
              
              )
```

If you view `lda_model$phi` you should see a matrix with 15 rows (one row per topic) and around 8000 columns (one column per term). The number in each row represent the probability of that term for the topic. (you'll sometimes see people refer to this matrix using the Greek letter $\phi$ or as the "topic-word distribution")

You can see the highest probability words in each topic using the command below, you'll probably be able to pick out some concepts pretty quickly

```{r}

#top 10 terms per topic
terms(lda_model, 10)%>%
  #converting to data frame (for prettier formatting)
  data.frame()

```


Meanwhile, the `lda_model$theta` represents the topic-word distribution matrix. If you view this object you should see one column per topic and one row per document. The numbers indicate the probability of a topic in a given document. (you'll often see this distribution denoted as $\theta$ or as the "document-topic matrix")

```{r}
# get the document-topic distributions for the first few articles: 
head(lda_model$theta)%>%
  data.frame()

```

One use of the the document topic distribution is for information retrieval. Based on the most likely terms, topic 13 appears to be picking up on stories related to the Supreme Court's decision overturning Roe v. Wade. So documents with a high % of this topic should probably be about that story. And based on the urls, this intuition seems correct: 

```{r}
    #get document data
dobbs_docs<-docvars(pol_dfm)%>%
  
  # add % of topic 13 to document data
  mutate(dobbs = lda_model$theta[,4])%>%
  # sort from lowest to highest topic %
  arrange(-dobbs)


dobbs_docs%>%
  select(url)%>%
  slice_head(n=10)



```


We can use this insight to get some sense of how the volume of coverage on this topic has changed over time. Here, we're getting the monthly average % of this topic within each document. The `dobbs_timeline` object just allows us to add a set of annotations to the timeline.

```{r dobbsprep}
# make a timeline of events (so we can add annotations to the timeline)
dobbs_timeline<-data.frame(event= c('Ginsburg death',
                                    'Cert granted',
                                    'Oral arguments',
                                    'Dobbs leak', 
                                    'Dobbs opinion'),
                date = as.Date(c("2020-09-18",
                                 '2021-05-17',
                                 '2021-12-01',
                                 '2022-05-02', 
                                 '2022-06-24'))
                  
                  )%>%
  
  # convert timeline to months (to allow joining with monthly coverage data)
  mutate(month = floor_date(date, unit='months'))

monthly<-dobbs_docs%>%
  # round down to the nearest month
  mutate(month = floor_date(date, unit='months')
         )%>%
  
  # group by month and compute the monthly average: 
  group_by(month)%>%
  summarise(coverage = mean(dobbs)
            )%>%
  # join with timeline data
  left_join(dobbs_timeline)
```

And from here, we'll create the plot in GGplot: 

```{r dobbsplot}
# create plot: 
monthly%>%
  ggplot(aes(x=month, y=coverage, label=event)) + 
  # create the line plot
  geom_line() +
  # add points for each month
  geom_point() +
  # turn off clipping (avoids removing labels if they're outside the plot area)
  coord_cartesian(clip = "off") + 
  
  # add labels (vjust -1 sets annotation just above the timeline)
  geom_label(vjust=-1) +
  ylab('Average topic %') +
  # using the minimal theme: 
  theme_minimal() +
  ggtitle('Average percent topic 13 (Roe overturned)')  +
  # show ticks every 3 months
  scale_x_date(breaks="3 month", date_labels = "%Ob '%y") +
  # labels at 45 degree angle
  theme(axis.text.x = element_text(angle = 45))
```


Lastly, you can use the [LDAvis package](https://github.com/cpsievert/LDAvis) to visualize and interpret your topics. 

The left panel uses a [principal component analysis](https://en.wikipedia.org/wiki/Principal_component_analysis) to model the "distance" between each topic. Topics that are further apart are less similar to one another, and larger bubbles indicate the topic is more common. Selecting a topic will cause the right panel to display relevant terms for that topic. Note: clicking the links in the bottom right will bring up more detailed information on what these metrics mean, but the short version is that they're both meant to make it easier to identify the most distinctive terms for each topic, rather than the terms that have the highest frequency. 

In order to (hopefully) simplify the process of setup, there's a custom function included in the header of this notebook that will allow you to save and embed these visualizations. Here's the code you would use to make this display on your local device:


```{r, eval=FALSE}
# a filename to save the plot under
modelvis<-'lda_vis.html'
# create the visualization
LDAvis_standalone(lda_model, outputfile=modelvis)
# bring up a window with the visualization: 
browseURL(modelvis)

```

You could take screenshots of this visualization for different topics in order to share it, but if you wanted to share the actual interactive version of this visualization in an Rmarkdown file or notebook, you could do it using the `includeHTML` function from the htmltools package. 

```{r ldavis, results='asis'}
lda_vis<-'lda_visual.html'
LDAvis_standalone(lda_model, outputfile=lda_vis)
htmltools::includeHTML(lda_vis)
```

## Seeded LDA 

In seeded LDA, you supply some initial terms and topics, and the model will infer additional terms based on your words. 

Here I've created a custom dictionary with 10 topics and just a handful of terms for each. Notice that terms with `*` will be able to match different endings, so `vaccin*` should be able to match any of vaccinate, vaccine, and vaccination. 

```{r seeded_lda}
news_dict<-list('covid' = c('covid*', 'vaccin*', 'lockdown*'),
                'natsec' = c('afghan*', 'taliban', 'putin', 'china', 'ukrain*','zelensky', 'terror*'),

                'elections' = c('election', 'caucus', 'primary', 'debate*', 'nomination', 'midterm', 'vote*', 'poll*'),
                'trump' = c('indictment', 'impeachment', 'mar-a-lago'),
                'jan6' = c("capitol", 'january', 'chansley','fraud'),
                'blm'  = c('floyd' , 'black', 'police', "protest*"),
                'immigration' = c('immigrat*',  'border', 'homeland', 'migrant*'),
                'climate' = c('global','climate*','fuel', 'carbon', 'emission*'),
                'abortion' = c('abort*',  'roe', 'wade', 'fetus', 'dobbs'),
                'economy' = c('inflation','price*', 'tax')
                )
# conver to a quanteda dictionary
dict_topic<-dictionary(news_dict)
set.seed(100) # setting random number seed
slda_model <- textmodel_seededlda(pol_dfm, 
                                 dictionary = dict_topic, 
                                 auto_iter = T, 
                                 residual = 5, # 5 residual topics for garbage
                                 max_iter=2000)
```

Again, we can extract topics from this model using the `terms` function. You'll notice that there are topics that correspond to the categories in the dictionary, but there are lots of words that weren't initially given to the model. One useful application of this approach is it can help you come up additional terms when constructing a dictionary. 


```{r sldaterms}
terms(slda_model)%>%
  #converting to a data frame (just to make it look nicer in the notebook)
  data.frame()
```


And, just like with regular LDA, we can use the inferred topics to do things like track news coverage over time (note that the columns now have names based on our dictionary) Here's how coverage of Covid-19 topic steadily dropped since 2020. 


```{r covidplot}

covid_df<-
  docvars(pol_dfm)%>%
  # add % of covid topic
  mutate(covid = slda_model$theta[,'covid'],
           # round down to the nearest month
         month = floor_date(date, unit='months'))%>%
  # group by month and compute the monthly average: 
  group_by(month)%>%
  summarise(covid = mean(covid))
covid_df%>%
  ggplot(aes(x=month, y=covid)) + 
  # create the line plot
  geom_line() +
  geom_point() +
  #add a marker for the start of Floyd protests:

  ylab('Average topic %') +
  # using the minimal theme: 
  theme_minimal() +
  ggtitle('Average % covid topic by month') +
  scale_x_date(breaks="3 month", date_labels = "%Ob '%y") +
  # labels at 45 degree angle
  theme(axis.text.x = element_text(angle = 45))
  
```


Lastly, the same method we used for LDAVis above should work here (unfortunately this doesn't pick up on the topic names)


```{r sldavis, result='asis'}
slda_vis<-'slda_vis.html'
LDAvis_standalone(slda_model, outputfile=slda_vis)
htmltools::includeHTML(slda_vis)
```

### Structural Topic Models

Unlike traditional LDA, Structural Topic Models [(paper)](https://scholar.harvard.edu/files/dtingley/files/topicmodelsopenendedexperiments.pdf) allow users to include additional document-level data when inferring their topics. The "prevalence model" will allow topic probabilities to vary systematically according to a characteristic of the document (such as the author or source), the "content model" will allow the term-probabilities within each topic to vary in the same way. For instance: if we think that different sources are more likely to talk about certain issues, we can explicitly model that information as part of the inference process. Here, we're using the source variable (Fox News vs. CNN) as a predictor for the topic probabilities. The assumption (a reasonably safe one) is that the two outlets will differ systematically in the sorts of news they cover. 

```{r}
library(stm) # for structural topic models
# convert DFM to a format usable by STM
dfm2stm <- convert(pol_dfm, to = "stm")
# number of topics
K<-15
# random number seed 
set.seed(1000)
# run the model
stm_model <- stm(dfm2stm$documents, dfm2stm$vocab, 
                 K = K, 
                 data = dfm2stm$meta, 
                 # include source in the prevalence model
                 prevalence =  ~ source,
                 init.type = "Spectral",
                 # set verbose =T if you want to watch the model progress
                 verbose = F 
                 )


```


The STM package includes a special `labelTopics` function to retrieve the terms associated with each topic. While the highest probability terms can be a good way to identify topics, it can be somewhat unhelpful because sometimes terms show up at the topic of the list because they're just common words in general. The FREX, lift, and score metrics all attempt to account for term frequency in some way in order to give greater weight to terms that are more distinct to each topic. 

```{r}

labelTopics(stm_model, topics=1:5, n=5) # note - only showing first 5 of 15 topics here

```

You can view `stm_model$theta` to get the document-topic probabilities. But STM also includes the `findThoughts` command to help identify documents that are representative of a given topic. Here, I'm getting a document that is representative of topic 6, which appears to be related to immigration and border control issues 
```{r,R.options=list(max.print=10)}
findThoughts(stm_model, 
             # list of texts (the dimnames part here ensures that the doc names line up with the names in the DFM)
             texts = pol_corpus[pol_dfm@Dimnames[[1]]], 
             # topic 6
             topics = 6 ,
             n = 1)
```


The more important innovation here is that we can estimate the effect of a covariate on the topic prevalence. Here, I'm estimating the effect of source on the prevalence of all 15 topics (although I'm only showing the results for topic 6)

```{r}
fx <- estimateEffect(1:15 ~ source , stm_model,
                        meta = dfm2stm$meta)
# only showing topic 6 (remove the $tables[[6]] part to see all topics)
summary(fx)$tables[[6]]

```

We can also plot the results (although this is very messy for more than a handful of topics). In either case, we see Fox is generally more likely to cover the immigration topic compared to CNN.

```{r}
plot.estimateEffect(fx, 
                    covariate = "source", 
                    topics = 1:6,
                    # value of fox news - cnn
                    cov.value1 = "Fox News",
                    cov.value2  = "CNN",
                    method='difference',
                    # you'll need to adjust this to keep the text from getting cut off 
                  xlim  = c(-.1, .1)
                    )
```

There's also a (slightly janky) package called [stminsights](https://cran.r-project.org/web/packages/stminsights/vignettes/intro.html) that will allow you to do some of this interactively. This isn't something you can really share unless you upload it to a server somewhere, but you can use it to generate and then download some plots which you could use elsewhere. 

```{r, eval=FALSE}
# for some reason, the dfm data must be named "out" or else this doesn't work. 
out<-dfm2stm
# save the relevant model outputs: 
save(file='stm_model.RData', list = c('stm_model', 'out', 'fx', 'pol_corpus'))
library(stminsights)
# go to browse and find stm_model.Rdata to get this running 
run_stminsights()
```


## Keyword Assisted Topic Models

Keyword Assisted Topic models [(see: paper)](https://imai.fas.harvard.edu/research/files/keyATM.pdf) are a new approach that allows both document covariates and seed terms to define topics. In short: it combines the SeededLDA approach with the Structural Topic Model. This is a pretty new method and the package is still under development, but I'm including it here because it might be of interest for those of you who are already using dictionary based methods. 

The setup for the basic model is almost identical to seeded LDA, the main practical difference is that it automatically infers the $\alpha$ and $\beta$ parameters that control how diffuse the $\theta$ and $\phi$ are, which can save you the effort of having to tune these parameters. 

We can use the same dictionary we implemented before. The keyATM package also provides some additional options for checking the quality of our selected keyword - notice that it throws a warning because some of the terms in our keyword list don't actually appear in the corpus. Some terms were probably too common and got removed when we ran `dfm_trim`, other terms may just be too rare to be useful. 

Although it will make the model run slower, you might want to increase  `max_docfreq` when you clean the data. 

```{r}
library(keyATM) # keyword assisted topic models
# our news dictionary
news_dict<-list('covid' = c('covid*', 'vaccin*', 'lockdown*'),
                'natsec' = c('afghan*', 'taliban', 'putin', 
                             'china', 'ukrain*','zelensky', 'terror*'),
                'elections' = c('election', 'caucus', 'primary', 'debate*', 
                                'nomination', 'midterm', 'vote*', 'poll*'),
                'trump' = c('indictment', 'impeachment', 'mar-a-lago'),
                'jan6' = c("capitol", 'january', 'chansley','fraud'),
                'blm'  = c('floyd' , 'black', 'police', "protest*", 'blm','black-lives-matter'),
                'immigration' = c('immigrat*',  'border', 'homeland', 'migrant*'),
                'climate' = c('global','climate*','fuel', 'carbon', 'emission*'),
                'abortion' = c('abort*',  'roe', 'wade', 'fetus', 'dobbs'),
                'economy' = c('inflation','price*', 'tax')
                )

# converted to a quanteda dictionary
dict_topic<-dictionary(news_dict)
# convert the pol_dfm to a structure preferred by this package
keyATM_docs<-keyATM_read(pol_dfm)
# convert the quanteda dictionary to a format used by keyATM 
keys<-read_keywords(docs = keyATM_docs, dictionary = dict_topic)

#visualize keywords: 
key_viz<-visualize_keywords(docs = keyATM_docs, keywords = keys)
key_viz
```


And here's the model training function. You'll probably notice that this is quite a bit slower than either the standard or seeded LDA. You might want to get this running and go grab a snack. The authors recommend saving the model after it is finished running, so I'm going to do that here. To load it back up, you would just need to run ``klda_model  <- readRDS('klda.rds')'`. 

```{r}
klda_model <- keyATM(
  docs              = keyATM_docs,    # text input
  no_keyword_topics = 5,              # 5 extra garbage topics
  keywords          = keys,       # our list of terms
  model             = "base",         # the base model
  
  options           = list(seed = 250) # set the random number seed inside the function
)

# save the trained model
saveRDS(klda_model, file='klda.rds')
```

You can get the highest probability words using the ``top_words`` function. The resulting dataframe will have a checkmark next to terms that were part of your keyword list. 

```{r}
top_words(klda_model)
```


Once again, we can use this to retrieve specific documents (and to check the quality of our topics)

```{r}
election_df<-docvars(pol_dfm)%>%
  # add % of covid topic
  mutate(election= klda_model$theta[,'3_elections'])%>%
  arrange(-election)

election_df%>%
  select(url)%>%
  slice_head(n=10)
```


And we can use this to track news coverage over time. Here, we see peaks in the "elections" topic that roughly coincide with presidential general election seasons in the time period. 

```{r}
election_coverage<-election_df%>%
  mutate(# round down month
         month = floor_date(date, unit='months'))%>%
  # group by month and compute the monthly average: 
  group_by(month)%>%
  summarise(election= mean(election))

election_coverage%>%
  ggplot(aes(x=month, y=election)) + 
  # create the line plot
  geom_line() +
  geom_point() +
  ylab('Average topic %') +
  # using the minimal theme: 
  theme_minimal() +
  ggtitle('Average % election topic by month') 

```

### Adding metadata

The real feature that distinguishes Keyword assisted topic models from seeded LDA is the ability to incorporate document meta data to predict topic prevalence a very similar fashion to the structural topic model. The syntax is here slightly different, but the basic intuition here is the same as in STM: we think that Fox News and CNN will vary systematically in the amount of attention they give to different subjects.
(note: this will take a while to converge)

```{r}
klda_covariate <- keyATM(
  docs              = keyATM_docs,
  no_keyword_topics = 5,
  keywords          = keys,
  model             = "covariates",
  model_settings    = list(covariates_data    = docvars(pol_dfm),
                           covariates_formula = ~ source),
  options           = list(seed = 250)
)

# save the results after running 
saveRDS(klda_covariate, file='klda_covariate.rds')
```

We can the top words using the same methods as before. It's possible that the results here will have improve somewhat because the topic model is now incorporating data on the source into the metadata. However, the main finding of interest will likely come from viewing the impact of the covariates on the topic probabilities. We can retrieve these with `by_strata_DocTopic()` and plot them using the command below. Here, we're plotting the prevalence for the immigration topic. The results suggest that Fox News is significantly more likely to cover this issue (at least in this sample)

```{r}

#sometimes covariates will be renamed automatically
#use covariates_info(klda_covariate) to see how your covariates might have been renamed
strata_topic <- by_strata_DocTopic(
  klda_covariate, 
  # "source" here was automatically changed to sourceFox News
  by_var = "sourceFox News",
  labels = c("CNN", "Fox")
)

plot(strata_topic, var_name = 'source', show_topic= 7)

```

## BERTopic

Lastly, while probably outside the scope of this course, I do want to highlight a final variation of the topic model known as BERTopic [paper](https://arxiv.org/pdf/2203.05794.pdf). Rather than inferring topics using a document-term-matrix, BERTopic uses an embedding model - a kind of numeric representation of text that captures semantic similarities between words, sentences or documents (good primer on this concept [here](https://www.pinecone.io/learn/series/nlp/dense-vector-embeddings-nlp/)). Importantly, we can use pre-existing embeddings that were trained on a huge corpus and apply them to a new set of documents, which gives us a major head-start when trying to train a model. 



```{r, echo=FALSE}
# load reticulate 
library(reticulate)
# run use_python(path-to-python .exe file) if you need to use a specific version

```


BERTopic generally works better on paragraphs or even single sentences, so we are going to use the same dataset here, but only take the first 5 sentences from each article. Other than that, we really don't need to do *any* pre-cleaning of the data because [the sentence embedding model](https://arxiv.org/abs/1908.10084) we're applying here was trained on regular human sentences and can use word order and context clues when producing numeric representations of sentences. 

```{python, eval=FALSE}

import pandas as pd
from nltk.tokenize import sent_tokenize, word_tokenize
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer

# read the corpus
poldocs = pd.read_csv('https://github.com/Neilblund/APAN/raw/main/news_sample.csv')

# split the text into sentences 
sentences = [sent_tokenize(text) for text in poldocs.text]
# get the first three sentences from each text
sentences = [' '.join(x[:4]) for x in sentences]


# apply a pre-trained embedding model on the sentences
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = embedding_model.encode(sentences, show_progress_bar=False)


topic_model = BERTopic(embedding_model = embedding_model)

# train the model
topics, probs = topic_model.fit_transform(sentences, embeddings)
 

# save the results in the working directory
topic_model.save("./bertmodel", serialization="safetensors", save_ctfidf=True, save_embedding_model=embedding_model)

```


The bertopic package offers a number of options for visualizing our results. The first plot shows an inter-topic distance map (similar to the kind provided by LDAVis). Notably: the BERTopic model inferred a much larger number of topics (~60) here than what we set in previous iterations, and many appear to reflect a much finer level of detail. 


```{python}
from bertopic import BERTopic

# loading the saved model
loaded_model = BERTopic.load("./bertmodel")

# visualize topic similarities:
loaded_model.visualize_topics()

loaded_model.visualize_barchart(topics = [ 1,3, 4, 14, 17] )
# getting topic counts by source
classes = r.poldf.source
topics_per_class = loaded_model.topics_per_class(r.poldf.text, classes=classes)

# abortion, ukraine, hunter biden, gender, and jan 6
loaded_model.visualize_topics_per_class(topics_per_class, topics = [ 1,3, 4, 14, 17], normalize_frequency= True)



```



# Other stuff to look at

Many of these examples were adapted from the Quanteda documentation. The package documentation is [here](http://quanteda.io/), and there is (also a tutorial [here](https://tutorials.quanteda.io/introduction/). One important area of text analysis not covered here are fully supervised models, where you take a set of labeled documents and train a model to predict a set of unlabeled documents. While useful and an important part of the text analysis field more broadly, it these models typically require a lot of hand coding before they really begin to be effective. That said, the quanteda tutorial above does go into some detail on how to train a basic text classifier in R. 

If you're really interested in going in depth, one of the best sources on this is [Dan Jurafsky](http://web.stanford.edu/~jurafsky/) who makes slides, lectures, and [even an entire textbook](https://web.stanford.edu/~jurafsky/slp3/) freely available online. 

